-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] refactor of dataset builder and executor #537
base: main
Are you sure you want to change the base?
Conversation
… space in file name
|
||
# Validate conversation structure | ||
for item in dataset: | ||
turns = self._parse_turns(item['text']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These classes are still in progress, right? Do they need to be updated or implemented later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dataset format of conversations can be referred here.
MAX_SAMPLE_SIZE = 1000 | ||
if isinstance(dataset, NestedDataset): | ||
sample_size = min(MAX_SAMPLE_SIZE, len(dataset)) | ||
sample = dataset.select(range(sample_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For hf dataset, we can use dataset.take(n)
method to get the top-n samples for higher efficiency. Related doc
} | ||
|
||
def load_data(self, **kwargs): | ||
dataset = rd.read_json(self.ds_config['path']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use RayDataset.read_json()
instead to support stream reading for json file. Ref:
data-juicer/data_juicer/core/ray_data.py
Lines 198 to 207 in 449cac1
@classmethod | |
def read_json(cls, paths: Union[str, List[str]]) -> RayDataset: | |
# Note: a temp solution for reading json stream | |
# TODO: replace with ray.data.read_json_stream once it is available | |
import pyarrow.json as js | |
try: | |
js.open_json | |
return read_json_stream(paths) | |
except AttributeError: | |
return rd.read_json(paths) |
|
||
def load_data(self, **kwargs): | ||
raise NotImplementedError( | ||
'Huggingface data load strategy is not implemented') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Huggingface data load strategy for Ray is not implemented'
@@ -86,7 +36,8 @@ def __init__(self, | |||
dataset: rd.Dataset, | |||
dataset_path: str = None, | |||
cfg=None) -> None: | |||
self.data = preprocess_dataset(dataset, dataset_path, cfg) | |||
self.data = dataset | |||
# self.data = preprocess_dataset(dataset, dataset_path, cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is preprocess_dataset
necessary? @pan-x-c
import pandas as pd | ||
import regex as re | ||
import requests | ||
from bs4 import BeautifulSoup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add bs4 in the minimal requirements
|
||
# The iterator and extractor code are in large part taken | ||
# from the Red-Pajama repo | ||
# https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/arxiv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# implementation of the Wikipedia dataset preparation: | ||
# https://github.com/huggingface/datasets/blob/7e30308f49f8c85dc7a2ab5aafbff04b5d2f38e2/datasets/wikipedia/wikipedia.py | ||
|
||
MEDIA_ALIASES = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not import them from datasets?
WORK_DIR = os.path.dirname(os.path.realpath(__file__)) | ||
|
||
|
||
@SKIPPED_TESTS.register_module() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment to describe the reason to skip this test.
|
||
|
||
def test_rewrite_cli_datapath_local_single_file(self): | ||
dataset_path = "./data/sample.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the path with the current file path WORK_DIR
is better for tracing for readers. Ref:
data-juicer/tests/ops/filter/test_audio_duration_filter.py
Lines 12 to 16 in b91683b
data_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), '..', | |
'data') | |
aud1_path = os.path.join(data_path, 'audio1.wav') # about 6s | |
aud2_path = os.path.join(data_path, 'audio2.wav') # about 14s | |
aud3_path = os.path.join(data_path, 'audio3.ogg') # about 1min59s |
Key elements of this PR:
a. 支持modelscope
b. 支持arxiv,下载、解压、引入
c. 支持wiki,下载、解压、引入
d. 支持commoncrawl,下载、解压、引入
design doc: https://aliyuque.antfin.com/yilei.z/cnk4dn/qomvqql62lyglrh2?singleDoc# 《Dataset/Loader/Executor的重构设计》